AITopics | online data

DesignofExperiments forStochasticContextualLinearBandits

Neural Information Processing SystemsFeb-10-2026, 23:38:06 GMT

In the stochastic linear contextual bandit setting there exist several minimax procedures for exploration with policies that are reactive to the data being acquired.

data mining, machine learning, reinforcement learning, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Stanford (0.05)
North America > United States > California > Santa Clara County > Palo Alto (0.05)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
Information Technology > Data Science > Data Mining > Big Data (0.46)

Add feedback

Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms

Neural Information Processing SystemsDec-26-2025, 19:52:06 GMT

This paper focuses on supervised and unsupervised online label shift,where the class marginals $Q(y)$ variesbut the class-conditionals $Q(x|y)$ remain invariant. In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data. In the supervised setting, we must both learn a classifier and adapt to the dynamically evolving class marginals given only labeled online data. We develop novel algorithms that reduce the adaptation problem to online regression and guarantee optimal dynamic regret without any prior knowledge of the extent of drift in the label distribution. Our solution is based on bootstrapping the estimates of *online regression oracles* that track the drifting proportions. Experiments across numerous simulated and real-world online label shift scenarios demonstrate the superior performance of our proposed approaches, often achieving 1-3% improvement in accuracy while being sample and computationally efficient.

dynamic regret meet practical algorithm, name change, online label shift, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Neural Information Processing SystemsDec-26-2025, 13:38:09 GMT

This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic exploration and offline RL, we design a three-stage hybrid RL algorithm that beats the best of both worlds --- pure offline RL and pure online RL --- in terms of sample complexities. The proposed algorithm does not require any reward information during data collection.

hybrid reinforcement learning, provable statistical benefit, reward-agnostic fine-tuning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)

Add feedback

Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Huang, Xiao, Liu, Xu, Zhang, Enze, Yu, Tong, Li, Shuai

arXiv.org Artificial IntelligenceAug-12-2025

Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent's stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2508.06806

Country: North America (0.28)

Genre:

Instructional Material > Online (0.63)
Research Report > New Finding (0.48)

Industry: Education > Educational Setting > Online (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Actor-Critic based Online Data Mixing For Language Model Pre-Training

Ma, Jing, Dang, Chenhao, Liao, Mingjie

arXiv.org Artificial IntelligenceJun-2-2025

The coverage and composition of pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). To reduce the carbon footprint and financial costs of training, some data mixing methods, which applied the optimized domain weights of a small proxy model to train a larger one, were proposed. However, these methods did not evolute with the training dynamics. The existing online data mixing (ODM) method addressed this limitation by applying the multi-armed bandit algorithm as data sampling strategy. Yet, it did not consider the intra-domain interactions. In this paper, we develop an actor-critic based online data mixing (AC-ODM) method, which captures the varying domain weights by auxiliary actor-critic networks and consider the intra-domain interactions with the reward function. While constructing the dataset to pretrain a large target LLM, we directly apply the actor, which is trained with a small proxy LLM as the environment, as the sampling strategy. The transfer of sampling strategy can not only ensure the efficiency of dynamical data mixing, but also expedite the convergence of pretraining the target LLM. Numerical results demonstrate that AC-ODM-410M, which invokes the sampling strategy obtained by a proxy LLM with 410M parameters, reaching the optimal validation perplexity of ODM 71% faster, and improves performance on the zero-shot MMLU benchmark by 27.5% of accuracy, about 2.23x better on pass@1 of HumanEval benchmark.

ac-odm, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.23878

Genre: Research Report (0.73)

Industry: Energy (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Best Arm Identification with Possibly Biased Offline Data

Yang, Le, Tan, Vincent Y. F., Cheung, Wang Chi

arXiv.org Artificial IntelligenceMay-30-2025

W e study the best arm identification (BAI) problem with potentially biased offline data in the fixed confidence setting, which commonly arises in real-world scenarios such as clinical trials. W e prove an impossibility result for adaptive algorithms without prior knowledge of the bias bound between online and offline distributions. To address this, we propose the LUCB-H algorithm, which introduces adaptive confidence bounds by incorporating an auxiliary bias correction to balance of-fline and online data within the LUCB framework. Theoretical analysis shows that LUCB-H matches the sample complexity of standard LUCB when offline data is misleading and significantly outperforms it when offline data is helpful. W e also derive an instance-dependent lower bound that matches the upper bound of LUCB-H in certain scenarios. Numerical experiments further demonstrate the robustness and adaptability of LUCB-H in effectively incorporating offline data.

artificial intelligence, machine learning, offline data, (16 more...)

arXiv.org Artificial Intelligence

2505.23165

Country: Asia > Singapore (0.15)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.37)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Neural Information Processing SystemsMay-27-2025, 08:12:03 GMT

This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic exploration and offline RL, we design a three-stage hybrid RL algorithm that beats the best of both worlds --- pure offline RL and pure online RL --- in terms of sample complexities. The proposed algorithm does not require any reward information during data collection. Our theory is developed based on a new notion called single-policy partial concentrability, which captures the trade-off between distribution mismatch and miscoverage and guides the interplay between offline and online data.

artificial intelligence, hybrid reinforcement learning, machine learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Neural Information Processing SystemsMay-26-2025, 17:12:30 GMT

Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy, but a weaker partial coverage condition suffices for online RL methods. This separation provides one explanation of why online RL methods can perform better than offline methods, especially when the offline preference data is not diverse enough.

large language model, machine learning, natural language, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.61)

Add feedback

Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms

Neural Information Processing SystemsJan-19-2025, 22:48:12 GMT

This paper focuses on supervised and unsupervised online label shift,where the class marginals Q(y) variesbut the class-conditionals Q(x y) remain invariant. In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data. In the supervised setting, we must both learn a classifier and adapt to the dynamically evolving class marginals given only labeled online data. We develop novel algorithms that reduce the adaptation problem to online regression and guarantee optimal dynamic regret without any prior knowledge of the extent of drift in the label distribution. Our solution is based on bootstrapping the estimates of *online regression oracles* that track the drifting proportions.

dynamic regret meet practical algorithm, label distribution, online label shift, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.83)

Add feedback

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Neural Information Processing SystemsJan-19-2025, 19:05:51 GMT

This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic exploration and offline RL, we design a three-stage hybrid RL algorithm that beats the best of both worlds --- pure offline RL and pure online RL --- in terms of sample complexities. The proposed algorithm does not require any reward information during data collection. Our theory is developed based on a new notion called single-policy partial concentrability, which captures the trade-off between distribution mismatch and miscoverage and guides the interplay between offline and online data.

hybrid reinforcement learning, provable statistical benefit, reward-agnostic fine-tuning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Filters

Collaborating Authors

online data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

DesignofExperiments forStochasticContextualLinearBandits

Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Actor-Critic based Online Data Mixing For Language Model Pre-Training

Best Arm Identification with Possibly Biased Offline Data

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning